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Abstract 

We propose a method to infer causal structures containing both dis- 
crete and continuous variables. The idea is to select causal hypotheses 
for which the conditional density of every variable, given its causes, be- 
comes smooth. We define a family of smooth densities and conditional 
densities by second order exponential models, i.e., by maximizing con- 
ditional entropy subject to first and second statistical moments. If 
some of the variables take only values in proper subsets of R™, these 
conditionals can induce different families of joint distributions even for 
Markov-equivalent graphs. 

We consider the case of one binary and one real-valued variable 
where the method can distinguish between cause and effect. Using 
this example, we describe that sometimes a causal hypothesis must be 
rejected because P(ef f ect|cause) and P(cause) share algorithmic in- 
formation (which is untypical if they are chosen independently) . This 
way, our method is in the same spirit as faithfulness-based causal in- 
ference because it also rejects non-generic mutual adjustments among 
DAG-parameters. 

1 Introduction 

Finding causal structures that generated the statistical dependences among 
observed variables has attracted increasing interest in machine learning. Al- 
though there is in principle no method for reliably identifying causal struc- 
tures if no randomized studies are feasible, the seminal work of Spirtes et 
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al. pQ and Pearl [2] made it clear that under reasonable assumptions it is 
possible to derive causal information from purely observational data. 

The formal language of the conventional approaches is a graphical model, 
where the random variables are the nodes of a directed acyclic graph (DAG) 
and an arrow from variable X to Y indicates that there is a direct causal 
influence from X to Y . The definition of "direct causal effect" from X to Y 
refers to a hypothetical intervention where all variables in the model except 
from X and Y are adjusted to fixed values and one observes whether the 
distribution of Y changes while X is adjusted to different values. As clarified 
in detail in [2] , the change of the distribution of Y in such an intervention 
can be derived from the joint distribution of all relevant variables after the 
causal DAG is given. 

The essential postulate that connects statistics to causality is the so- 
called causal Markov condition stating that every variable is conditionally 
independent of its non-effects, given its direct causes [2j. If the joint distribu- 
tion of Ai, . . . , X n has a density p(xi, ■ ■ ■ , x n ) with respect to some product 
measure fi (which we assume throughout the paper) , the latter factorizes [3j 
into 



where paj is the set of all values of the parents of Xj with respect to the 
true causal graph. The conditional densities p(xj\pa,j) will be called Markov 
kernels. They represent the mechanism that generate the statistical depen- 
dences. 

A large class of known causal inference algorithms (like, for instance, 
PC, IC, FCI, see [HE]) are based on the causal faithfulness principle which 
reads: among all graphs that render the joint distribution Markovian, prefer 
those structures that allow only the observed conditional dependences. In 
other words, faithfulness is based on the assumption that all the observed 
independences are due to the causal structure rather than being a result of 
specific adjustments of parameters. One of the main limitations of this type 
of independence-based causal inference is that there are typically a large 
number of DAGs that induce the same set of independences. Rules for the 
selection of hypotheses within these Markov equivalence classes are therefore 
desirable. 

Before we describe our method, we briefly sketch some methods from 
the literature. [5] have observed that linear causal relationships between 
non-Gaussian distributed random variables induce joint measures which re- 
quire non-linear cause-effect relations for the wrong causal directions. Their 
causal inference principle [6] for linear non-Gaussian acyclic models (short: 



p(xi, . . . , x n ) = U™ =l p{xj\paj) 
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LiNGAM), is based on independent component analysis |j It selects causal 
hypotheses for which almost linear cause-effect relations are sufficient when- 
ever such hypotheses are possible for a given distribution!! [8] generalized 
this idea to the case where every variable is a possibly (non-linear) function 
of its direct causes up to some additive noise term that is independent of the 
causes (see also [9] and [10]). Under this assumption, different causal struc- 
tures induce, in the generic case, different classes of joint distributions even 
if the causal graphs belong to the same equivalence class. [Ill [12] generalized 
the model class to the case where every function is additionally subjected to 
non-linear distortion compared to the models of [8]. However, all these algo- 
rithms only work for real- valued variables and the generalization to discrete 
variables is not straightforward. 

Here we describe a method (first proposed in our conference paper [13]) 
that can deal with combinations of discrete and continuous variables, it even 
benefits from such a combination. More precisely, it requires that at least 
one of the variables is discrete or attains only values in a proper subset of M n . 
We define a parametric family of conditionals that induce different families 
of joint distributions for different causal directions. The underlying idea is 
related to an observation of [H] stating that the same joint distribution of 
combinations of discrete and continuous variables may have descriptions in 
terms of simple Markov kernels for one DAG but require more complex ones 
for other DAGs. In contrast to [2], we define families of Markov kernels 
that are derived from a unique principle, regardless of whether the variables 
are discrete or continuous. 

To describe our idea, assume that X is a binary variable and Y real- 
valued and that we observe the joint distribution shown in Fig. [TJ Let p(y) 
be a bimodal mixture of two Gaussians such that both p(y\x = 0) and 
p(y\x = 1) are Gaussians with the same width but different mean. Then it 
is natural to assume that X is the cause and Y the effect because changing 
the value of X then would simply shift the mean of Y. For the converse 
model Y — > X, bimodality of Y remains unexplained. Moreover, it seems 
unlikely, that conditioning on the effect X separates the two modes of p(y) 
even though X is not causally responsible for the bimodality. 

To show that there are also joint distributions where Y — > X is more 
natural, assume that p(y) is Gaussian and the supports of p(y\x = 0) and 
p(y\x = 1) are (— oo,yo] and [yo,oo), respectively, as shown in Fig. [2J One 

x It was implemented in Matlab by P. O. Hoyer, available at 
http:/ /www.cs. helsinki.fi/group/neuroinf/lingam/ 

2 Apart from this, it has been shown that the linearity assumption helps also for causal 
inference in the presence of latent variables (see, e.g. [7]). 
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Figure 1: Left: Joint density p(x,y) of a real- valued variable X and a binary 
variable X suggesting a model X — > Y, because the influence of X on Y then only 
consists of shifting the mean of the gaussian. The causal hypothesis Y — > X is less 
likely: only specific choices of p(x\y) would separate the bimodal Gaussian mixture 
p(y) (right) into two separate modes. It requires an even more specific conditional 
p(x\y) to make the components gaussian. 




Figure 2: Joint density p(x,y) of a real-valued random variable Y and a binary 
variable X. The marginal distribution p{y) is Gaussian. The causal hypothesis 
Y — > X is plausible: the conditional p(ar|y) corresponds to setting x = 1 for all 
y above a certain threshold. We reject the converse hypothesis X — > Y because 
p(y\x) and p(x) share algorithmic information: given p(y\x), only specific choices 
of p(x) reproduce the Gaussian p(y), whereas generic choices of p(x) would yield 
"odd" densities of the type on the right. 
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can easily think of a causal mechanism whose output x is 1 for all inputs 
y above a certain threshold yo, and otherwise. Assuming X —* Y, we 
would require a mechanism that generates outputs y from inputs x according 
to p{y\x). Given this mechanism, there is only one distribution p(x) of 
inputs for which p(y) is Gaussian. Hence, the generation of the observed 
joint distribution requires mutual adjustments of parameters for this causal 
model. 

In Section 0] we will describe more formal arguments that support this 
way of reasoning. It is based on ideas in [15] and [16] to draw causal con- 
clusions not only from statistical dependences. Instead, also algorithmic 
information can indicate causal directions. 



2 Causal inference using second order exponential 
models 

Here we define a parametric family of Markov kernels p(xj\paj) that describe 
a simple way how Xj is influenced by its parents PAj. Without loss of 
generality, we will only consider complete acyclic graphs, i.e., the parents 
of Xj are given by X\, . . . , Xj—i (the general case is implicitly included by 
setting the corresponding parameters to zero). 

The domains Tj of Xj are subsets of R j with integer Hausdorff dimen- 
sion [T7] dj, i.e., we exclude fractal subsets. In Sections [3] and [5] we will 
consider, for instance, intervals in R, circles in R 2 , and countable subsets of 
R. We define 

p(x\) := exp [afxi + x\{i\\X\ - z\) 
p(xj\xi, . . -,Xj-i) := exp (ajxj + xj ^ fyjXj - z j (x 1 , . . . ,Xj-\)) ,(2) 

with vector- valued parameters ay and matrix- valued parameters f3ji. The 
log-partition functions Zj are given by 

z(xi,...,Xj-i):=log / exp (ajxj + xj^2/3jiXi)d^j(xj) , 

where the reference measure //,■ is given by the product of the Hausdorff 
measures [17J of the corresponding dimensions dj, and only parameters are 
allowed that yield normalizable densities. The term "Hausdorff measure" 
only formalizes the natural intuition of a volume of sufficiently well-behaved 
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subsets of M n : For a circle, for instance, it is given by the arc length, for 
countable subsets it is just the counting measure. 

For every reordering ir of variables, the second order conditionals define 
a family of joint distributions V n . The key observation on which our method 
relies is that TV and TV need not to coincide if some of the variables Xj have 
domains Ta that are proper subsets of W 1 (if, for instance, all Xj can attain 
all values in R, then P v is the set of all non-degenerate n-variate Gaussians 
for all 7r and we cannot give preference to any causal ordering). 

Our inference rule reads: if there are causal orders ir for which the ob- 
served density p is in 7V> prefer them to orderings ff for which p £ 7V To 
apply this idea to finite data where p is not available, we prefer the orderings 
7r for which the Kullback-Leibler distance between the empirical distribution 
and p is minimized, i.e., the likelihood of the data is maximized. 

In Section [5l we discuss experiments with just two variables X, Y. We 
have several cases where X is binary and Y real-valued, and one example 
where X is two-dimensional and attains values on a circle and Y is real- 
valued. This shows that also causal structures containing only continuous 
variables can be dealt with by our method when some of the domains are 
restricted. We now describe the algorithm. 

Second order model causal inference 

1. Given an m x n matrix of observations 

2. Let X±, . . . , X n be an ordering tt of the variables. 

3. In the jth step, compute p(xj\xi, . . . , Xj-i) by minimizing the condi- 
tional inverse log-likelihood 



To compute the partition function numerically, we discretize and bound 
the domain to a finite set of points. 

4. Compute the corresponding joint density p n (xi, . . . , x n ) and its total 
log-likelihood 




with 




6 



5. Select the causal orderings for which L w is minimal. This can be a 
unique ordering or a set of orderings because not all orderings induce 
different families of joint distributions and because values L w and 
are considered equal if their difference is below a certain threshold. 

For a preliminary justification of the approach, we recall that condition- 
als of this kind occur from maximizing the conditional Shannon entropy 
S(Xj\Xi, . . . , Xj-i) subject to P(X\, . . . , Xj-i) and subject to the given 
first and second moments |18j . for more details see also |19j : 

E(Xj) = cj (3) 
E(XjXi) = dij, (4) 

where E{Z) denotes the expected value of a variable Z. For multi-dimensional 
Xj, the Cj are vectors and the are matrices. Bilinear constraints are the 
simplest constraints for which the entropy maximization yields interactions 
between the variables Xj (apart from this, linear constraints would not 
yield normalizable densities for unbounded domains). In this sense, second 
order models generate the simplest non-trivial family of conditional densi- 
ties within a hierarchy of exponential models [20] that are given by entropy 
maximization subject to higher order moments. 

[19] provides a thermodynamic justification of second order models. The 
paper describes models of interacting physical systems, where the joint dis- 
tribution is given by first maximizing the entropy of the cause-system and 
then the conditional entropy of the effect-system, given the distribution of 
the cause. Both entropy maximizations are subject to energy constraints. If 
we assume that the physical energy is a polynomial of second order in the 
relevant observables (which is not unusual in physics), we obtain exactly the 
second order models introduced here. 

3 Identifiability results for special cases 

Here we describe examples that show how the restriction of the domains to 
proper subsets of K can make the models identifiable. A case with vector- 
valued variables has already been described in [13] , where we have considered 
the causal relation between the day in the year and the average temperature 
of the day. The former takes values on a circle in M. 2 , the latter is real- valued. 
Second order models from day to temperature induce seasonal oscillations 
of the average temperature according to a sine function, which was closer to 
the truth than the second order model from temperature to day in the year. 
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However, in the following examples we will restrict the attention to one- 
dimensional variables. 



3.1 One binary and one real- valued variable 

A simple case where cause and effect is identifiable in our model class is 
already given by the motivating example with a binary variable X and a 
variable Y that can attain all values in R. 

Second order model for X — > Y 

Using both equations (J2j) , we obtain 

1 _(v-^) 2 

p(x = 1) = 7 P(y\x = j) = ~7=^e , (5) 

V27T/9 

with parameters 7, vq,i/i, p. Both distributions p(y\x = j) for j = 0, 1 are 
obviously Gaussians with equal width and different mean, i.e., p(y) is a 
mixture of two Gaussians (see Fig. [H left). 

Second order model for Y — > X 

We obtain 

1 hi-") 2 1 , 

P(y) = 7= e p(x = l\y) = -(l + tanh(ay + /3)) , (6) 

Vina * 



with parameters v, a, a, /3, where we have used 



' -(1 + tanh(a/2)). (7) 



1 + e a e~ a + 1 2 

A typical joint distribution for the model Y — > X is shown in Fig. [3J 

Since mixtures of two different Gaussians can never yield a Gaussian as 
marginal distribution p(y), the only joint distribution that is contained in 
the model classes for both directions is a product distribution of a Gaussian 
p{y) and an arbitrary binary distribution p(x). This shows that the models 
are identifiable except for the trivial case of independence. 

Furthermore, eqs. © and ([6]) show that our method is indeed consistent 
with the intuitive arguments we gave for the examples in the introduction: 
the Gaussian mixture in Fig. [1] is a second order model for X — > Y and the 
example with thresholding y (Fig. [2]) can be approximated by second order 
models for Y — » X via the limit a — > 00 in eq. ([6]). 
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y 



Figure 3: Joint density p(x, y) for binary X and real-valued Y induced by a second 
order model Y — > X. Here we have chosen a relatively steep sigmoid function for 
p(x = l\y), which leads to a steep decrease at the right of the left mode and the left 
of the right mode. An infinitely steep sigmoid function yields sharp thresholding 
as in Fig. [2] 



3.2 More than three binary variables 

We first simplify equation ([2j) for the case that all variables X\ , . . . , X n are 
binary. Writing anj := x±, . . . , Xj-\ for the ancestors of Xj, we obtain 



p(xj = l\arij) 



exp(a i + f3jj + V ; ; p jt 



1 + exp(a i + (3jj + ^2 i<:j pjiXi) ' 
Using eq. ([7]) yields 

1 J ^ 

p(xj = l\arij) = -(l + tanh(Aj + XjiXi)) . (8) 

2 i=i 

with 

A i := \ ( a j + Pjj) and X ji = for i = • • • > 3 ~ 1 • 

The joint distributions induced by these conditionals do not coincide 
for all causal orders provided that n > 4. To show this, we first observe 
that second order models can approximate the causal relation between the 
inputs and the output of an (n — l)-bit OR gate. Then we show that the 
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conditional probability for one input, given the other n — 2 inputs and the 
output is significantly more complex than a second order model since it 
requires polynomials of degree n — 2 as argument of the tanh- function (which 
corresponds to polynomials of degree n — 1 in the exponent in the same way 
as second order models lead to linear arguments of tanh). 

The OR gate with input X\, . . . , X n _i and output X n is described by 

n-l 

p(x n = l\an n ) = 1 - JJ(1 -Xi) . 

i=l 

Introducing a sequence of second order conditionals by 

Pk(x n = l\an n ) := - (1 + tanh(-A; + 2k a?*)) , 

i=i 

we have 

lim p k (x n \an n ) = p(x n \an n ) , 

k^oo 

and thus they approximate the OR-gate. 

Let the inputs X±, . . . ,X„_i be sampled from the uniform distribution 
over {0, l} n . We then have 

( ii , f 1 for X 2 = X; 

p(xi = l|x 2> x 8 ,... I x B _i,x„ = l) = | i 

and 

p (xi = l\x 2 = ■ ■ ■ = x n = 0) = . 

Note that the event X n = and Xi = 1 for some i £ {2, . . . , n — 1} does 
not occur and the corresponding conditional probabilities need not to be 
specified. 

We now show that the joint distribution cannot be approximated by sec- 
ond order models if X n is not the last node. For symmetry reasons, it is 
sufficient to show that p(x\\x2, • • • , x n ) has no second order model approxi- 
mation. If such an approximation existed, we would have 

p(x 1 = l\x 2 ,...,x n ) = lim ^fl + tanh (q k (x 2 , ■ ■ ■ ,x n ))) , (11) 

k^oo 2 \ J 

where q k is a sequence of linear functions in x 2 , ■ ■ ■ , x n , see eq. ([8]). We prove 
that eq. (|11|) can indeed be satisfied with q k of polynomials of order n — 2, 



0) 
(10) 
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but not for any sequence of polynomials of lower order (which shows that 
these non-causal conditional can be rather complex). Introducing 



q k (x 2 , . . ■ ,x n -i) := qk{x2, ■ ■ ■ ,x n -i,x n = 1) , 
eq. is equivalent to 

hm^ 2 ,...,x n _ 1 ,x n = l) = | Q otherwige 

(12) 

If the space of polynomials of degree n — 3 or lower contained such a sequence 
q~k > completeness of finite dimensional real vector spaces implies that it also 
contained 

9 ■■= lim -rr—rr-qk , 

fc^oo I <?fc 1 



1 forx 2 = x 3 = • • • = x n _i = 
otherwise . 



which is given by 

g(x 2 ,...,x n -i) = 

However, 

g (x 2 , x n -i) = JJ (1 - Xi) , 

i=2 

which is a polynomial of degree n — 2. Hence, q^ (and also qj-) consists at 
least of polynomials of order n — 2. To see that this bound is tight, set 

/ 71—1 \ 



n-1 



q k (x 2 , ■ ■ ■ ,x n ) := k 2(x n - 1) - JJ (1 



\ i=2 / 

and observe that it satisfies eq. (|12p and 

lim q k (x 2 = 0, . . . , x n = 0) = -00 , 

fc^oo 

and thus the corresponding conditionals satisfy asymptotically eqs.Q and 

By inverting logical values, the same proof applies to AND gates. Since 
AND and OR gates are reasonable models for many causal relations in 
real-life, it is remarkable that the corresponding non-causal conditionals of 
the generated joint distribution already require exponential models of high 
order. Successful experiments with artificial and real-world data with four 
binary variables are briefly sketched in |21j . 
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4 Justification of our method by algorithmic in- 
formation theory 



4.1 The principle of independent conditionals 

Since second order models provide a simple class of non-trivial conditional 
densities, Occam's Razor seems to strongly support the principle of pre- 
ferring the direction that admits such a model. However, Occam's Razor 
cannot justify why we should try to find simple expressions for the causal 
conditional P(ef feet | cause) instead of simple models for non-causal condi- 
tionals like P(cause|ef f ect). Here we present a justification that is based 
on recent algorithmic information theory based approaches to causal infer- 
ence. 

|15] proposed to prefer those DAGs as causal hypotheses for which the 
shortest description of the joint density p{x\ , . . . , x n ) is given by separate de- 
scriptions of causal conditionals p(xj\pa,j) in eq. ([T]). We will refer to this as 
the principle of independent conditionals (IC). Here, the description length 
is measured in terms of algorithmic information [22\ [23l [2^] , sometimes also 
called "Kolmogorov complexity". Even though it is hard to give a precise 
meaning to this principle, it provides the leading motivation for our theory. 

To show this, we reconsider one of the examples from the introduction. 
We have argued that the distribution in Fig. [2] is unlikely to be generated 
by the causal structure X — > Y because the observed distribution p{x) is 
special among all possible p{x) since it is the only distribution that yields a 
Gaussian marginal p{y) after feeding it into the conditional p{y\x). Hence, 
after knowing p(y\x), the input distribution p(x) is simply described by 
"the unique input that renders p{y) Gaussian". Thus, a description for 
p(x,y) that contains separate descriptions of p{y\x) and p{x) would contain 
redundant information and the IC principle would be fail. We propose a 
slightly modified version of IC that will be more convenient to use because 
it refers to algorithmic dependences between unconditional distributions: 

Postulate 1 (independence of input and modified joint distr.) 

// the joint density p(x, y) is generated by the causal structure X —* Y then 
the following condition must hold: 

Let p(x) be a hypothetical input density that has been chosen without 
knowing p(x,y). Define p(x,y) := p(y\x)p(x). Then p(x) and p(x,y) are 
algorithmically independent. 

The idea is that p(x, y) only contains algorithmic information about 
p(y\x) and p(x). The object p(y\x) has been chosen independently of p{x) 
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"by nature", as in [15], and p(x) has been chosen independently of p(y\x) 
by assumption. 

Due to the lack of a precise meaning of the concept of "algorithmic 
information of probability densities", as it would be required by [15] and 
our modified postulate, we will describe arguments that avoid such concepts 
but still rely on the above intuition. 

4.2 The framework for probability- free causal inference 

We therefore rephrase the probability-free approach to causal inference de- 
veloped by [16] . The idea is that causal inference in real life often does not 
rely on statistical dependences. Instead, similarities between single objects 
indicate causal relations. Observing, for instance, that two carpets contain 
the same patterns makes us believe that designers have copied from each 
other (provided that the patterns are complex and not common). [16] de- 
velop a general framework for inferring causal graphs that connect individual 
objects based upon algorithmic dependences. Here, two objects are called 
algorithmically independent if their shortest joint description is given by 
the concatenations of their separate descriptions. It is assumed that every 
such description is a binary string s formalizing all relevant properties of an 
observation. Then the Kolmogorov complexity K(s) of s is defined by the 
length of the shortest program that generates the output s and then stops. 
Conditional Kolmogorov complexity K(s\t) is defined as the length of the 
shortest program that computes s from the input t. If t* denotes the short- 
est compression of t, K{s\t*) can be smaller than K(s\t) because there is no 
algorithmic way to obtain the shortest compression (the difference between 
K(s\t) and K(s\t*) can at most be logarithmic in the length of t |25j). 
The strings s and t are conditionally independent, given r if 

K(s, t\r) f» K(s\r) + K{t\r) . (13) 

As in the statistical setting, unconditional depedences indicate causal links 
between two objects s and t: if K(s, t) <C K(s) + K{t) , the descriptions can 
be better compressed jointly than independently and we postulate a causal 
connection. The following terminology [26] will be crucial: 

Definition 1 (Algorithmic mutual information) 

For any two binary strings s,t, the difference 

I(s : t) := K(s) + K(t) - K(s,t) = K(s) - K(s\t*) = K(t) - K(t\s*) 
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is called the algorithmic mutual information between s and t. As usual 
in algorithmic information theory, the symbol — denotes equality up to a 
constant that is independent of the strings s, t, but does depend on the Turing 
machine K(.) refers to. 

To also infer causal directions we have postulated a causal Markov con- 
dition stating conditional independence of every object from its non-effects, 
given its causes. We will here state an equivalent version (see Theorem 3 in 

my- 

Postulate 2 (Algorithmic Markov condition) 

Let G be a DAG with the binary strings s%, . . . ,s n as nodes. If every Sj is 
the description of an object or an observation in real world and G formalizes 
the causal relation between them, then the following condition must hold. 
For any three sets S,T,R C {si, . . . , s n } we have 

S ±T\R* 

in the sense of eq. \13\). whenever R d-separates S and T (for the notion of 
d- separation see, e.g., [2]). Here we have slightly overloaded notation and 
identified the set of strings with their concatenation. 

Moreover, R* denotes the shortest compression of R. In particular, we 
have 

S ±T, 

whenever S and T are d-separated (by the empty set). Here, the threshold 
for counting dependences as significant is up to the decision of the researcher 
and not provided by the theory. 

4.3 Distinguishing between cause and effect 

Based on the above framework and inspired by the IC-principle, [16] describe 
the following approach to distinguishing between X — > Y and Y —> X for 
two random variables X, Y after observing the samples (xx,yi), . . . , (x^, y^). 
One considers the causal structure among 2k + 2 individual objects instead 
of a DAG with the two variables as nodes. These objects are: The x- 
values, the y-values, the source S emitting x-values according to p{x) and a 
machine emitting y- values according to p(y\x). The causal DAG connecting 
the objects is shown in Fig. HI left. One may wonder why there are no arrows 
from the x-values to M even though M gets them as inputs. The reason 
is that the object M is not changed by the Xj, i.e., the conditional p(y\x) 
remains constant. 
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Figure 4: Left: Causal structure obtained by resolving the statistical sample gen- 
erated by the causal structure X — > Y into single observations. Right: Modified 
structure where the input comes from a different source S' that samples according 
to a different distribution p'(x). Note that (xi,x 2 ) and (x 3 , X4, 2/3, t/4) must be al- 
gorithmically independent because there is no unblocked path between these two 
sets of nodes. 
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It has been pointed out [16] that the DAG in Fig. left, already imposes 
the algorithmic independence relation 



. . . , X m _L Um+l 



...,Vk |Om+l, • • • 



Vm < k 



(14) 



and describes examples where this is violated after exchanging the role of X 
and Y. This is an observable implication of the algorithmic independence of 
the unobservable objects S and M. The relevant information about S and M 
is given by p(x) andp(y|x), respectively, hence condition (I14j) is closely linked 
to Lemeire's and Dirkx's postulate. [16J discusses toy examples for which 
the destinction between X — > Y and Y — > X is possible using condition (|14|) . 

For our purposes, it will be more convenient to work with a slightly 
different condition that can be seen as a finite-sample counterpart of Pos- 
tulate [TJ To this end, we consider Fig. [U right. Let x 1 := xi, . . . ,x m be 
the sample of x-values from source S (here m = 2). x 2 := x m+ i,...,Xfc 
denote the x-values from source <S" and y 2 the corresponding y-values. The 
d-separation criterion yields the unconditional relation 



Of course, we do not assume that we have the option to really change the 
input distribution from p{x) to p(x) (i.e., replacing the source S with S'), 
otherwise we could directly test whether X causes Y by observing whether 
such an intervention also changes p(y). Our way of reasoning will be in- 
direct: given that the true causal structure is X — > Y, we could simulate 
the effect of the intervention by choosing a subsample of x-values that is 
distributed according to p(x) and know that the corresponding pairs (x, y) 
are distributed according to p{y\x)p(x). 

4.4 Applying the theory to second order models 

We now describe the violation of condition (|15p for the example in Fig. [2l 
The true model Y — ► X involves the parameters u, a, and (3 (mean and stan- 
dard deviation of the Gaussian and threshold of y- values for which x = 1). 
We denote the corresponding density therefore by p va «. Now we consider 
the non-causal conditional p v ,a,p{y\ x )- One checks easily that different triples 
{y, a, (3) indeed induce different p(y\x). On the other hand, there is a unique 
input probability p 7 (x = 1) = 7 such that the marginal 



x 1 X x 2 ,y 2 . 



(15) 




X 
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is Gaussian: 

y = f(v,a,P):=-=^ e~^^ dy . (16) 
V lira J -oo 

The fact that / that does not involve any free parameters would already 
be in contradiction with Lemeire's and Dirkx's postulate if X — > Y were 
true because / only has constant description length and the parameters can 
be described with arbitrary accuracy. For sufficiently large accuracy, the 
description length of 7 thus exceeds the length of / and eq. (fT6|) provides a 
shorter description for 7 than its explicit binary representation. 

According to our finite-sample point of view, the parameters 7, v, a, (3 
must only be described up to an accuracy that corresponds to the error made 
when estimating them from a finite sample: Observing x 1 , i.e., an ensemble 
of x-values drawn from p(x), we can estimate 7 up to a certain accuracy. 
Similarly, we estimate u, a, j3 after observing x 2 , y 2 up to a certain accuracy. 
Then, the estimator for fi and the estimator of the other parameters will 
approximately satisfy the functional relation (116p . This shows that x , on 
the one hand, and x 2 ,y 2 on the other hand, share at least that amount of 
algorithmic information shared by the estimators because the latter have 
only processed the information contained in the observations. 

For general second order models, the argument reads as follows. For 
Y — > X we have parameter vectors ay and a x \y f° r p{v) an d p(x\y), re- 
spectively. The joint distribution is determined by a := (ay, olx\y)- Fac- 
torizing p a (x,y) into the non-causal conditionals p(x) and p(y\x) leads to 
families po(x) and Pr)(y\x), where Po(x)pri(y\x) only is an element of the 
family p a (x, y) if 9 and 77 satisfy a certain functional relation and thus share 
algorithmic information. To translate this into algorithmic dependences be- 
tween (real and hypothetical) observations, we feed p n (y\x) with a modified 
input distribution pgi(x) and observe that the generated (x,y)-pairs still 
share algorithmic information with those :r-values that were sampled from 
the original input distribution because rj can be estimated from the new 
(x, y) pairs and 6 from the original x-values. If the parameters 0' and 9 
are algorithmically independent, the sources S and S' in Fig. U right, are 
independent and the causal hypothesis X — > Y implies independence of x 1 
and (x 2 ,y 2 ). 

This way of reasoning can further be generalized as follows: assume that 
p(x,y) can be described by po(x)p ri (y\x) where po(x) and p v (y\x) are some 
families of densities for which the map from 8, rj, x, y to the corresponding 
probabilities is a computable function. Then X — > Y can be rejected when- 
ever one of the parameter is determined by the other one via a computable 
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function / provided that the Kolmogorov complexity of both parameters is 
infinite (which can, of course, never be proved). The accuracy of estimating 
9, r\ depends on the statistical distinguishability of the (conditional) densities 
from those for slightly modified 9 + A9, rj + At/. Therefore, Fisher informa- 
tion of parametric families plays a crucial role in the following quantitative 
result: 

Theorem 1 (dependent parameters violate the algorithmic MC) 

Let pe(x) with 9 G M. d and Pr){y\x) with rj G M. d be computable families of 
continuously differ entiable (conditional) densities. Define the Fisher infor- 
mation matrix for pe(x) by 

n?\ f f dlogpe{x)dlogp e (x)\ 

where fj,x defines the Hausdorff measure corresponding to X. Define the 
conditional Fisher information matrix for p n {y\x) with respect to the refer- 
ence input distribution pq{x) by 

,n \ f f 9logp v (y\x) dlogpr,(y\x)\ 

(G 1h e)ij ■= J y J p n {y\x)pe{x)d^x{x)d^ Y {y) ■ 

Let xi, . . . ,x k be drawn from p e (x) and (x k+ i,y k+1 ),. . . ,(x 2 k,y2k) from 
P0'(x)p v (y\x) where Fg and G^gi are non-singular and 9 and r/ are generic 
in the sense that a description up to an error e (in vector norm) requires 
d log 2 e or d log 2 e bits, respectively. 

Assume, moreover, that 9 and rj are related as follows. If d > d, let 
f(9) = rj for some continuously differentiable function f with K(f) = 0. For 
d < d, let g(j]) = 9 for some continuously differentiable g with K(g) = 0. 

Then the algorithmic mutual information between the x-values sampled 
from the original distribution and the (x,y) -pairs generated by the modified 
input distribution satisfies asymptotically almost surely 

I(x!,...,x k : x k+ i, . . . ,x 2 k,Vk+i, ■ ■ ■ ,V2k) > cmin{(i,(i}log£; 
for every c < 1/2. 

Note that the requirement of "generic" parameter values (in the sense 
we used the term) can be met by a model where "nature chooses" them 
according to some prior density. Since the statement is only an asymptotic 
one, the theorem holds regardless of the prior. 
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Proof of Theorem [TJ Assume first that d > d. We define an estimator 6 for 
by minimizing 

k 

i=i 

Hence, 

\\e- e\\ < 1 



(k\e) c 

with probability converging to 1 for k — > oo if \g denotes the smallest eigen- 
value of Fg. This is because (6 — 9)/Vk is asymptotically a <i-dimensional 
Gaussian with concentration matrix Fg. The standard deviation of the Gaus- 
sian is maximal for the direction corresponding to \g and is then given by 

We construct an estimator fj by minimizing the inverse loglikelihood 

2k 

j=k+i 

Since G^p is non-singular, p v (y\x) is a strict minimum of the expected 
loglikelihood. As for the unconditional distributions above, (fj — rj)/ \fk is 
asymptotically Gaussian and the probability for 

IM-'Kp^ <17) 

tends to 1 if v n denotes the smallest eigenvalue of G Vt Qi . 

Denoting the operator norm of the Jacobi matrix Df(0) by ||-D/(0)||, we 
obtain 

< \\Df(e)\\\\e-e\\+o(\\e-ef) < {\\ D f{e)\\+5)\\e-e\\ , (is) 

where the last inequality holds asymptotically almost surely for any 5 > 0. 
Due to the error bounds (fT8|) and (fTTj) we have 

e:=l/ fl_,,<!£A2!+i + 1 



(k\ e ) c (ku n y 

asymptotically with probability 1 — e for any desired e > 0. Since r\ is a 
generic value, the amount of information required to specify it up to an ac- 
curacy e grows asymptotically with — log 2 e (up to some negligible constant). 
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On the other hand, f\ and f(9) share at least this amount of information 
because they also coincide up to an accuracy e. Hence, 

/(/(0M)>-log 2 e. 

Asymptotically, — log 2 e grows with clogk. Hence the mutual information 
between fj and f{9) is asymptotically larger than — cd\og 2 k bits for every 
c < 1/2. Hence we have 

+ 

I(x 1 ,...,x k : x k +i, ■ ■ ■ ,x 2k ,y k+1 , ■ ■ ■ ,2/2*0 > l(f(0) ■ V) > cd\og 2 k. 
The first inequality follows because 

I(a : b) > 1(5 : b) 
whenever K(a\a) = K(b\b) = (cf. Theorem II. 7 in [26]). Here 

K(f(e)\ Xl ,...,x k )^o 

because f{6) is computed from the k observed x-values by the above esti- 
mation procedure and the application of /. Likewise, t) is derived from the 
observed (x, y)-pairs. 

The case for d < d is shown similarly. We estimate r\ and 9 and show 
that they share algorithmic information because 9 is a simple function of rj. 
□ 

Now we present our main theorem stating that second order models 
between one binary and one real- valued variables induce joint distributions 
whose non-causal marginals and conditionals are algorithmically dependent 
in the sense of Theorem [TJ 

Theorem 2 (Justification of second order model inference) 

Let X be a binary variable and Y real-valued and the density ofp(x,y) be 
given by a second order model from Y to X for some generic values of the 
parameters u, a, a, (3 in eq. (0|) ; left and right. Then the causal hypothesis 
X — > Y contradicts the algorithmic Markov condition. This is because the 
x-values sampled from p{x) contain algorithmic information about the (x,y)- 
pairs obtained after changing the "input" distribution p{x) (see Fig\Q right) 
and keeping p(y\x). 

Likewise, ifp(x, y) admits a second order model from X toY with generic 
values 7, v>o,Ui,p (see eq. (0), then Y — > X must be rejected. 

The amount of the shared algorithmic information grows at least loga- 
rithmically in the sample size. 
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The remainder of this section is devoted to the proof of Theorem [2] and 
a Lemma that is required for this purpose. To show that the conditions 
of Theorem Q] are met, we determine the parameter vectors 8,rj of the non- 
causal conditionals, show that they satisfy a functional relation and that the 
Fisher information matrices are nonsingular. To prove the latter statement, 
we will use the following result: 

Lemma 1 Let pg(x.) for all 9 6 / C M. d be a differentiable family of contin- 
uous positive definite densities on a probability space Q C W 71 with respect 
to the reference measure fi. Assume there are d points xi, X2, . . . , such 
that the matrix A(9) defined by 

A{6) :=(Vp e ( Xl ),...,Vp e (x d )) , 

or the matrix 

1(0) := (Vlogp e (xi),...,Vlogp e (x (i )) 
is non-singular. Then the Fisher information matrix Fg is non-singular. 

Proof: the Fisher information matrix can be rewritten as 
Hence, 

F e = I ^-(Vp,(x))(Vp,(x)) T d / u(x) = / p (x)(Vlogp e (x))(Vlogp e (x)) T dM(x) 

It thus is the weighted integral over all rank one matrices 

(Vp e (x))(V Pe (x)) T . 

At the same time, it can also be written as a weighted integral over all 

(Vlogp e (x))(Vlogp e (x)) T . 

Note that for any vector-valued continuous function v and strictly positive 
scalar function q, the image of the matrix 

g(x)u(x)v(x) T dfj,(x) 

is given by the span of all v(x). Fg thus is the span over all {Vpe(x)} x and, 
at the same time, the span over all {V logp6i(x)} x . □ 
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We are now able to prove the main theorem: 



Proof (of Theorem [2]): First consider the case where p(x,y) has a second 
order model from Y to X. To apply Theorem Q] we have to show that Gq^ 
is non-singular. We can use Lemma Q] even though it is not explicitly stated 
for conditional densities because we can apply the latter to the joint density 
Ptj(x) := pe(x)p v (y\x) for x := (x,y) and fixed 9. Then 

Vlogp v (x,y) = Vlogp v (y\x) , 

and 

Vp^(x, y) = p e (x) Vp r ,(y\x) , 

i.e., it is sufficient to check whether the gradients of the conditional or its 
logarithm span a d-dimensional space. We have 



P^d* = hy) = ^= (1 + tanh(a y + /?)) e~^~ = a ^ { ^ 2ay+W) , 
where we have used eq. ([7]). This yields 



Prefix = 1) = ^ / l^k+2P dy ■ ^ 
Introducing the parameter vector n := (a,f,a,j3) we obtain 

1 e 2 CT 2 

^ (y|x = 1} = p,(x = l)a^(l + e2^) ' 

where the input distribution p(x) still is formally parameterized by r\ and 
will be written in terms of one relevant parameter 9 below. In the appendix 
we provide 4 points y±,...,y4 and a value rj = rjo for which the vectors 
X7pr)(yj\x = 1) are linearly independent. Hence G„$ is non-singular for ijq 
and all 9. All entries of Gg tTj are analytical functions in every component of 
r\ because they are uniformly converging integrals over analytical functions. 
Hence, regularity of G Vt g for one r\ already shows regularity for generic n. 
Now we parameterize p(x) by an one-dimensional parameter 

:=p v (x = 1) =g(n), 

where Prj(x) is given by the integral in eq. (|19p . This defines the family of 
densities pe(x) via 

pg(x = 1) := 9. 
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Hence Fq is one-dimensional. It is clearly non-singular for generic 9 because 

dp e {x = 1) 
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Using K{g) = 0, Theorem [T] shows that the x values sampled from pg(x) 
share algorithmic information with the (x, y)-pairs sampled from pgi (x)p v (y\x) . 

Now consider the case that there is a second order model from X to Y. 
Hence 

p 8 (y) = —j= (1 - 7)e + ie ~^ 

with the parameter vector 9 = (7, VQ,v\ : p). Note that we now apply Theo- 
rem [JJ with exchanging the role of X and Y . To show that Fq is non-singular 
we compute Vpg(y) and find points yi, • • • ,2/4 and a value 6> such that the 
corresponding gradients are linearly independent (see Appendix [7)) . Hence 
Fq is nonsingular due to Lemma [TJ As above, this also holds for generic 9. 

For the conditional density of X given Y, only a function of 9 is relevant 
(as above) but we start by writing it first in terms of 9 and reduce the 
parameter space later to the relevant part: 

(y-vif 

P-r,v ,ui,p( x = = 2pZ 
Introducing 



(1 — 7)e 2 p 2 + 7e 2 p 2 



a:=-^{y Q -v 1 ) (20) 



and 



the conditional is of the form 

P-y,a,p(x = l\y) 



1 



f3e°y + 1 



We define 77 := (a,/3) and check that G^e is non-singular. For doing so, we 
compute Vp v (x = l\y) and find values rjo and yi, ■ ■ ■ ,y2,U3 such that the 
gradients are linearly independent (Appendix). Hence G^fi is non-singular 
for one 770 and all 9 and thus also for generic pairs rj, 9. The function g is 
given by 5(7, z/q, i/j., p) := (a, /3) with a and /? as in eqs. (f20|) and (f2T|) . which 
satisfies -ftT(g) — 0. This shows that the y-values sampled from pg(y) share 
algorithmic information with the (rc,y)-pairs sampled from PQ'(y)p v (x\y) by 
Theorem [TJ □ 
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5 Experiments 



We conducted 8 experiments with real- world data for which the causal struc- 
ture is known. In all cases we had pairs of variables where one is the cause 
and one the effect. Even though there may also be hidden common causes, 
prior knowledge strongly suggests that a significant part of the dependences 
are due to an arrow from one variable to the other. The selection of datasets 
was based on the following criteria: We have chosen several examples where 
one variable is binary and the other one is either continuous or discrete with 
a wide range, because this is the case where identifiability becomes most 
obvious (see Subsection I3.1j) . To demonstrate that we have identifiability 
for various types of value sets we have also included an example with a vari- 
able of angular-type and example with positive variables. The restriction to 
positive values, however, only leads to significantly different distributions for 
different causal directions if there is enough probability close to the bound- 
ary. Otherwise, the second order models yield almost bivariate Gaussians 
and the direction is not identifiable. Most examples of the data base "cause 
effect pairs" in the NIPS 2008 causality competition [27] are of this type, 
except for the examples with "altitude". 

Our algorithm constructs the domains by binning the observed values 
into intervals of equal length instead of asking for the range as additional 
input. If the differences of the loglikelihoods are too small, our algorithm 
will not decide for either of the causal directions. We have set the treshold 
to 

\L -L l< 1 L - + L - 
1 ~* - 10000 2 

The choice of this threshold, however, is the result of our limited number of 
experiments. Our theory in Section 0] only states the following: if the true 
distribution perfectly coincides with a second order model in one direction 
but not the other, the latter one has to be rejected because this causal struc- 
ture would require unlikely adjustments. For the case where the distribution 
is only close to a second order model it is hard to analyze how close it should 
be to justify our causal conclusion. The answer to this question is left to 
the future. 



Meteorological data 

Experiment No. 1 considers the altitude and average temperature of 675 lo- 
cations in Germany [28]. The statistical dependence between both variables 
is very obvious from scatter plots and one observes an almost linear decrease 
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of the temperature with increasing altitude. The fact that a significant part 
of the points are close to altitude (i.e., the minimal value) is important for 
identifiability of the causal direction because the restriction of the domain 
to positive values can only be relevant in this case. 

Experiment No. 2 studies the relation between altitude and precipitation 
of 4748 locations in Germany [28J. Here both variables are positive- valued, 
which also leads to different models in the two directions. 

In experiment No. 3, we were given the daily temperature averages of 
9162 consecutive days between 1979 and 2004 in Furtwangen, Germany |29j . 
The seasonal cycle leads to a strong statistical dependence between the 
variable day in the year (represented as a point on the unit circle S 1 C 
M?) and temperature, where the former should be considered as the cause 
since it describes the position of the earth on its orbit around the sun. 

Human categorization 

Our experiments No. 4 and No. 5 consider two datasets from the same 
psychological experiment on human categorization. The subjects are shown 
artificially generated faces that interpolate between male and female faces 
[30] . The interpolation correponds to switching a parameter between 1 and 
15 (in integer steps). The subjects are asked to decide whether the face is 
male (answer=0) or female (answer=l). The experimentalist has chosen 
parameter values according to a uniform distribution on {1, . . . , 15}. 

No. 4 studies the relation between parameter and answer. Since the 
experimentalist chose uniform distribution over {1, . . . , 15} and the depen- 
dence of the probability for answer= 1 is close to a sigmoid function, the 
empirical distribution is here very close to the second order model corre- 
sponding to the correct causal structure parameter — > answer. 

Our experiment No. 5 studies the relation between the response time and 
the parameter values. Since the response time is minimal for both extremes 
in the parameter values, we have strongly non-linear interactions that cannot 
be captured by second-order models. It is therefore not surprising that there 
is no decision in this case. 

Census data 

Experiments No. 6 and No. 7 consider census data from 35.326 persons 
in the USA [31]. In No. 6, the relation between age and marital status 
is studied. The latter takes the two values for never married and 1 for 
married, divorced, or widowed. No. 7 considers the relation between gender 
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and income. Here we assume that the gender is almost randomized by nature 
and there we thus expect no confounding to any observable variable. 

Constituents of wine 

Experiment No. 8 considers the concentration of proline in wine from two 
different cultivars. We assume that the binary variable cultivar is the 
cause, even though one cannot exclude that the proline level (if relevant for 
the taste) directly influenced the decision of the cultivar to choose this sort 
of wine. 

List of results 

The results are shown in the below table. The ground truth is always that 
variable 1 influences variable 2, i.e., we have one wrong result and no decision 
in two cases. 



No. 


variable 1, domain 


variable 2, value set 




L_ 


result 


1 


altitude, R+ 




temperature, R 


3.3697 


3.4366 




2 


altitude, R+ 




precipitation, R + 


3.5885 


3.6343 




3 


day of the year, S 1 


temperature, R 


5.7448 


5.7527 




4 


parameter, {1,. . . 


,15} 


answer, {0, 1} 


4.1143 


3.1150 




5 


parameter,{l,. . . 


,15 } 


time, R+ 


3.9873 


3.9873 


? 


6 


age, K+ 




marrital status, {0, 1} 


4.9918 


4.9920 


? 


7 


sex, {0, 1} 




income, R + 


3.8770 


3.8758 




8 


cultivar, {0, 1} 




proline, R + 


3.9209 


3.9496 





6 Discussion and relations to independence-based 
causal inference 

In section [J] we have shown for a special case that the model X — > Y must 
be rejected if there is a second order model from Y to X because it required 
specific mutual adjustments of p(x) and p(y\x) to admit such a model. We 
have already mentioned that this is the same idea as rejecting unfaithful 
distributions. Indeed, [15] argued that the Markov kernels in unfaithful 
distributions share algorithmic information. Hence algorithmic information 
theory provides a unifying framework for independence-based approaches 
and those that impose constraints on the shapes of conditional densities. 

The following example makes this link even closer because it shows that 
in some situations the same constraints on a joint distribution may appear 
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\t\i\r\ 

A t+1 ) ( A , + i) ( A t+ i) A wj V t+v 



Figure 5: Two Layers in the causal chain. If the components are only influenced 
by horizontally adjacent ones from the layer above, the Markov condition further 
simplifies the forward time conditional. 

as independence constraints from one point of view and as constraints on the 
shape of conditionals from an other perspective. Consider the causal chain 

X 1 - X 2 > X n , (22) 

where every Xj is a vector of dimension d. Structures of this kind occur, 
for instance, if Xj represents the state of some system at time t and the 
dynamics is generated by a first order Markov process. Due to the causal 
Markov condition the joint distribution factorizes into 

p{xi)p(x 2 \xi) ■ ■ -p(x n\Pn— 1 ) > 

but no constraints are imposed on the conditionals p(xj\xj—i). 

Assume now we consider each component X^ of layer variable in 

its own right and thus obtain a causal structure between n := nd variables. 
Assuming that no component Xj is influenced by components of the same 
layer, p(xj\xj-\) must be of the form 

pix^Xj-x) = ntiP(xfki-i) • (23) 

Moreover, if we assume that every X- is only influenced by some of the 
variables in the previous layer, the conditional further simplifies into 

p{xj\xj-i) = Uf^pixj'lpaji) , 
where paji denote the values of PA^j, i.e., the parents of x!p (Fig. [5]). Hence, 
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the fine-structure of the causal graph imposes constraints on p(xj\xj—i) that 
are not imposed by the coarse-grained structure. 

Assume we are given data from the above time series, but it is not known 
whether the true causal structure reads 



X n - 



*1 



or the one in (|22|) . When resolving the vectors in their components, we have 
to reject the latter hypothesis because the independences 



xf ] x xP 



(24) 



would violate faithfulness. From the coarse-grained perspective, we can only 
reject the causal hypothesis by imposing an appropriate simplicity principle 
of conditionals. Then we conclude that (|22p is more likely to be true because 
the Markov kernels are simpler because they satisfy eq. (I24p . Finding further 
useful simplicity constraints has to be left to the future. 



7 Appendix 

7.1 Matrix for p(y\x) 
We have: 

log Pa,v,aAy\ x = !) = " 



log a — log v 2-7T - 



(y - vf 

2a 2 



Hence 



dlogp v (y\x=l) 
du 

dlogp v (y\x=l) 
da 

d\og,Pr,{.y\x=l) 
da 

dlogPn{y\x=l) 
8(3 



4 + 



y-y _ <91ogPr)(j=l) 
\y-y) 2 



dv 

dlogp v (x=l) 

ct 3 du 

e 2ay+2l3 91ogp^(a: = l) 



-2y e2ay +2p +1 Q a 
o e 2a »+ 2 ^ _ dlogp„Qr=l) 



log e 



+ 1 



hi{y) 

My) 

h 3 (y) 
h±{y) , 



Intuitively, it is quite evident that the functions hj are linearly independent 
for generic r\ because h\ contains linear terms in y, h,2 is a polynomial of 
degree two in y, /13 contains y and an expression with an exponential func- 
tion in the denominator, /13 contains only the exponential expression in the 
denominator. We can thus find points y±, ... ,2/4 such that the row vectors 
{f l j(yi),hj(y2),hj(y3),hj(y^)) are linearly independent. Instead of proving 
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this directly (which would involve derivatives of the logarithms of marginals), 
we use the following indirect argument: Choose 5 values y' , . . . , y' 4 and con- 
sider the rows 

[h j (y' ),h J (y' 1 )...,h j (y > i )] j = l,...,4. (25) 

Consider the projection of M 5 onto the quotient space M 5 /M(l, 1, 1, 1, 1) and 
represent the images of the vectors ([25]) by 



[hj(y'i) ~ hjiy'^^jiy'z) - hj(y' ), . . . ,hj{y' A ) - hj(y' )] j = 1,... ,4. 

Check that these 4 vectors are linearly independent (which can fortunately 
be done without computing the derivatives of log auaj3 p(x = 1)), hence the 
row vectors (f25j) are independent, too. We can thus select 4 values yi, . . . , 2/4 
from j/q, . . . ,2/4 such that the rows (hj(yi), . . . , hj(y^)) are independent. We 
have numerically checked this for v = o = a = (3 = 1, = j. 

7.2 Matrix for p(y) 

The coefficients of Vp#(y) read: 

dpe(y) 1 



$7 V2ttp 



~2p- _|_ g 2p2 



<9p e (y) 1 (l-7)(y-i/ ) 



2 



.2 



d Pe (y) _ 1 7(^-^1) 

-6 ^ 



2 



9^1 y^/O P 

d Ps (y) _ 1 f _ ;) f {y-v»? 



<9p I V P 3 P 

The vectors are linearly independent for the points yj = j for j = 1, . 
with 7 = 1/2, uq = 0, v\ = 1, p = 1. 

7.3 Matrix for p(x\y) 
Introducing the function h with 

h y (ri) := (3e ay + 1 
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we have 



p v (x = l\y) = l/hy{jj) 



and thus 



Vp v (x = l\y) = - 



1 



Vh y ( V ) , 



hUn) 



with 



= y(3e ay and 



dhy{r]) 

df3 



da 



Since the functions h\(y) := yf3e ay and ti2(y) := e ay are linearly independent 
for generic a, (3, we can obviously find values yi, yi such that Vp r? (x = l|yi) 
and Vp r) (x = 1 1 j/2 ) are linearly independent. 
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